CSR Corpus Development
نویسنده
چکیده
The CSR (Connected Speech Recognition) corpus represents a new DARPA speech recognition technology development initiative to advance the state of the art in CSR. This corpus essentially supersedes the now old Resource Management (RM) corpus that has fueled DARPA speech recognition technology development for the past 5 years. The new CSR corpus supports research on major new problems including unlimited vocabulary, natural grammar, and spontaneous speech. This paper presents an overv iew of the CSR corpus, reviews the definition and development of the "CSR pilot corpus", and examines the dynamic challenge of extending the CSR corpus to meet future needs.
منابع مشابه
CSR Data Collection Pilot
The objective of the CSR Corpus Development is to collect and deliver a large corpus of continuous speech data to support DARPA research efforts in continuous speech recognition (CSR). The CSR corpus is intended to be task independent and to consist of speech that is similar to that which would be expected from eventual users of real world CSR systems. Toward these ends, the current pilot colle...
متن کاملSession 11: Continuous Speech Recognition And Evaluation I
This was the first of two companion sessions which marked an impor tant transit ion in the continuous speech recognition (CSR) component of the DARPA Spoken Language Program. Since 1987, DARPA CSR systems have been developed and evaluated on the Resource Management (RM) CSR corpus, which has become a de .facto standard for comparison of speech recognizers, widely accepted and used both within a...
متن کاملCollection and Analyses of WSJ-CSR Data at MIT
Recently, the DARPA community started a new data collection initiative in the Wall Street Journal (WSJ) domain to support research and development of very large vocabulary continuous speech recognition (CSR) systems. Since August 1991, our group has actively participated in the development of the WSJ-CSR corpus. The purpose of this paper is to document our involvement in this process, from reco...
متن کاملNIST-DARPA Interagency Agreement: Spoken Language Program
1. To coordinate the design, development and distribution of speech and natural language corpora for the DARPA Spoken Language research community. 2. To design, coordinate implementation, and analyze results, of performance assessment "benchmark tests" for DARPA's speech recognition and spoken language understanding systems. 1. Completed production of the six-CD-ROM-set for ATIS0, and made this...
متن کاملBuilding and Incorporating Language Models for Persian Continuous Speech Recognition Systems
In this paper building statistical language models for Persian language using a corpus and incorporating them in Persian continuous speech recognition (CSR) system are described. We used Persian Text Corpus for building the language models. First we preprocessed the texts of corpus by correcting the different orthography of words. Also, the number of POS tags was decreased by clustering POS tag...
متن کامل